library(tidyverse)
library(haven)
data <- read_sas("/Users/sseals/Library/CloudStorage/GoogleDrive-sseals@uwf.edu/My Drive/STA6257/datasets/analysis1.sas7bdat")
head(data)STA6235: Modeling in Regression
As a professional statistician, how do I know what modeling technique is appropriate?
I keep a “road map” in my mind:
Continuous? Try normal first. Maybe gamma if skewed.
Categorical? One of the logistics.
Count? Poisson or negative binomial.
Note that we have not talked about all modeling techniques.
Hurdle models.
Cox proportional hazards model.
Models for time series data.
Models for spatially correlated data.
Models for correlated data.
It helps when we “look” at the data.
Continuous data does not always lend itself to the normal distribution.
Sometimes it needs a transformation.
We can do a preliminary check on the outcome (y) to determine shape of distribution, but remember that the assumptions with normal distribution is on the residuals - so we need to check those assumptions on the backend as well.
Categorical data needs to be explored to make sure that there is not sparseness in a category.
Count data needs to be explored to verify that it is a count, determine if we should use either Poisson or negative binomial, and determine if we should consider a zero-inflated model.
Recall that we have been working with the Jackson Heart data,
Systolic and diastolic blood pressures are “known” to have normal distributions, however, it is still important for us to check.
If I think that a variable may be continuous, I will first look at a histogram of it to see what its shape is.
Note that the histogram will also allow us to identify any obvious outliers
There is slight skew to systolic blood pressure.
We can also see that there are no “weird” values in the dataset.
Call:
lm(formula = sbp ~ age + sex + age:sex, data = data)
Residuals:
Min 1Q Median 3Q Max
-43.416 -10.046 -1.145 8.374 66.148
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 98.42277 1.69377 58.109 < 2e-16 ***
age 0.49026 0.03023 16.216 < 2e-16 ***
sexMale 9.29023 2.79065 3.329 0.000883 ***
age:sexMale -0.11913 0.05060 -2.355 0.018619 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.62 on 2645 degrees of freedom
(4 observations deleted due to missingness)
Multiple R-squared: 0.1203, Adjusted R-squared: 0.1193
F-statistic: 120.5 on 3 and 2645 DF, p-value: < 2.2e-16
Let’s now check the model assumptions.
Remember, when using the normal distribution, we assume that the residuals have a normal distribution with mean 0 and common variance.
We check this with the anova_check() function.
Let us consider diabetic status (Diabetes; 0=non-diabetic, 1=diabetic).
In this case, it is a yes/no response, so I know that binary logistic regression is appropriate.
However, when working with categorical data, I look at frequency tables.
If I need a quick look at frequencies, I use the count() function:
What about the three level categorization of diabetic status (diab3cat; 0=non-diabetic, 1=pre-diabetic, 2=diabetic)?
Using the same method as before, we will confirm the categories and eyeball sample sizes.
We see that there are 3 levels of diab3cat and from the description, that 0=non-diabetic, 1=pre-diabetic, 2=diabetic.
There is a “linear” progression from non-diabetic to pre-diabetic to diabetic… ordinal logistic should be considered.
The Brant test can be used to determine if ordinal logistic regression is valid.
Remember that if ordinal logistic regression is not valid, we can step back and perform nominal logistic regression.
You were asked to create a variable that counts the number of controllable risk factors for stroke:
With count data, we also need to consider zero-inflation.
We can see from the frequency chart that we do not have any 0 values at all, so that is not a concern.
However, suppose we do have 0 values, then we can look at a histogram to see if there’s a spike:
We also want to decide between Poisson and negative binomial.
Remember that the assumption of Poisson is that the mean equals the variance. That’s easy to check!
Poisson will be fine here but we can use negative binomial if we want to.
We are going to practice determining what distribution is necessary. Remember that we have only learned the following models:
I often say that I spend more time doing data management and investigation than I do actually analyzing the data.
I also say that statisticians/data scientists are often more detectives than anything.
Do not be afraid to ask questions of those that are giving you data or are responsible for data creation/collection.
In the best scenarios, there will be data dictionaries available.
In the worst scenarios, there will be no answers to your questions.
In all scenarios, you just do the best with what you have and keep notes / report on whatever assumptions are being made on the data for modeling purposes.